The skillIQ and roleIQ tests are addictive. I haven’t used Pluralsight to learn and improve my technical skills yet, but I can see how the assessments would drive interaction and frequent improvement of subscribers. What a fun way to encourage personal and professional development!

Data Exploration Questions

1. Describe and visualize how the distributions of user and question rankings compare and relate between assessments.

User Ranking Distributions

Overall Ranking Metrics

Interaction Ranking Metrics

Question Ranking Distributions

Question Ranking Metrics by Assessment

2. How does it appear the algorithm determines when a user’s assessment session is complete?

We can evaluate the algorithm’s determination to stop asking questions using a time-series of each assessment. The obvious guess is a minimal threshold for question-to-question changes in the RD value. Something very similar to that guess is confirmed by observing a random sample of several user_assessment_session_ids.

It’s probably worth checking the other metrics associated with a session (display_score, percentile, and ranking) to confirm our suspicions regarding rd as the main variable driving the algorithm. Per the plots below of the same three assessment sessions we see that rd is the only metric of the four that seems an appropriate option.

A closer look at the distribution of the minimum rd values of each assessment’s interaction shows that a simple threshold of 80 drives the stopping rule. Over 75% of the sessions were stopped at a rd value below and very near 80. While that seems like an arbitrary value to me, I am sure there was some empirical and theoretical studies performed to determine that threshold. Also 75% may seem low, but that includes all sessions, even those that were stopped prematurely by the user (as discussed in #3).

##        0%        5%       10%       15%       20%       25%       30% 
##  77.99380  78.34009  78.42960  78.51234  78.59056  78.65580  78.70929 
##       35%       40%       45%       50%       55%       60%       65% 
##  78.76540  78.82022  78.87147  78.93500  79.00305  79.08484  79.19052 
##       70%       75%       80%       85%       90%       95%      100% 
##  79.34391  79.65148  94.55070 124.50190 156.89920 202.17270 256.61200

3. Which of the assessments has the highest and lowest dropout rates, respectively?

## # A tibble: 2 x 2
##   rd_threshold `n()`
##          <dbl> <int>
## 1            0  1608
## 2            1  5070
## # A tibble: 32 x 3
## # Groups:   rd_threshold [?]
##    rd_threshold n_questions_answered `n()`
##           <dbl>                <int> <int>
##  1            0                    0   271
##  2            0                    1   211
##  3            0                    2   189
##  4            0                    3   165
##  5            0                    4   112
##  6            0                    5   127
##  7            0                    6   101
##  8            0                    7   100
##  9            0                    8    70
## 10            0                    9    54
## # ... with 22 more rows

4. Is there significant variance in question difficulty by topic within a given assessment?

5. How many times must a question be answered before it reaches its certainty floor? Does that number appear to be constant or does it vary depending on question or assessment?

There are 724 questions in the dataset. I expect the rd metric to again indicate the certainty floor. A quick look at the distribution of rd values shows that floor to be 30. However, many (%) of the assessment_item_ids show all of their rd values to equal 30. Maybe that’s because those are older questions that reached the floor (30) previous to this dataset.

We would really like to look at all 724 of these questions. We could examine much of the structure using trelliscopejs, a tool for interactively viewing a large collection of visualizations. The key opportunity when using trelliscope is that it allows for creation of a rich feature set that is used to sort and filter through the data helping us see nuances, outliers, and important features of that data.

A brief description of the cognostics (features) is available by clicking on the “i” in the upper left corner. You can search for interesting assessment_item_ids by using the Sort and Filter buttons on the left hand side. To see those assessment_item_ids that have values of rd other than 30, click on the Filter button, then on the “All RD values = 30” pill, then enter “0” into the right side. This will reduce the total number of panels from 724 to 209. To see panels (plots) where at least two points are present (and thus a plot is created), remain clicked into the Filter button, then click on the “Number of Question Interactions” pill, then enter 2 on the left hand size of the range selection. This immediately removes all the blank panels (not plotted because only one observation exists) and reduces the number of panels from 209 to 180. Clicking on the Filter button again closes that window. You can sort or filter further to test hypotheses or explore the data sliced by assessment_item_id. Happy exploring!

Obviously a rd value of 30 is important and relevant, but I didn’t find anything else that gave me sufficient confidence to answer this questions explicitly beyond saying there appears to be plenty of variation between questions.

More Involved/Open-ended Questions

1. Identify a metric that could be used to identify questions that are performing poorly, and consequently might need to be reviewed, changed, or removed.

  • Questions that render a nearly always incorrect answer, especially when the question difficulty is comparatively low. (Some questions are likely purposefully difficult so one expects those to rarely have a correct response.)
  • Questions that increase the RD metric substantially (though that may be a function of question order).
  • (thinking of a scatterplot comparing rd change due to that question vs current percentile of the user, meaning some identification of outliers occurring when rd change is high and negative and percentile was low)

2. Suppose an update to Python causes a question’s answer to change, but our question authors don’t notice, and the now-outdated question remains in the test. How might that scenario reveal itself in the data?

Hopefully it reveals itself as often rendering an incorrect response. That may not be true of more experienced or long-time users of that technology/language so one might need to account for that somehow. I noticed a link at the bottom of the page after the answer is revealed that provided an opportunity for a situation like this to be identified.

3. Given your response to number 2 in the Data Exploration Questions above, what is a method we could use to determine ideal points to stop a user’s assessment session (i.e. identify the right balance between certainty and burden on the user)?

I suppose you could try to account for the distribution/curve of previous assessments of that user. For example, if they have taken several assessments before the current assessment you may be able to predict/extrapolate the end score and ranking based on their position part way through the assessment.

Taking that a step further, why not treat each step of an assessment (for a giving topic) as a modeling and prediction opportunity by developing a deep learning model trained to the eventual outcome of the assessment. That way you could use the thousands (or millions) of assessments for that topic to generate a prediction such that you could stop the assessment once the prediction has reached a certain threshold of accuracy per the model. Just to be clear I am thinking of a different deep learning model (or potentially any predictive model) for each set of questions of a given topic in order. That wasn’t very clear so … one model based on five questions answered, then a model based on six questions answered, and so on.

4. How could we calculate the overall difficulty level of a particular topic? How might we then calculate a topic-level score for a single user?

You may get close by determining what combination of topics tend to be taken by users. If a set of users are prone to take the same five topic assessments (and rarely other topics) then you could look to see if which topic was the most difficult to that group. As an example, business analysts may consistently take the data warehousing, data analytics/visualization, SQL, and Python assessments and often struggle score lower in the Python assessment.

I wonder if the frequency of the topic assessed is an indicator of the difficulty. Certainly the frequency relates to the popularity and the general demand/usefulness of the topic; as well as the newness of the topic (newer tools/tech/languages may be taken less frequently - following an adoption curve). Fortran or other older languages/technologies may be considered more difficult simply because less modern learning methods exist for them.

How is “difficult” defined here?